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STATISTICAL ANALYSIS OF DATA Of} CLASSIFICATION 
OF RADIOGRAPHS OF ALUMIEUM ALLOY SPOMCIMENS 


For each of four aluminum alloys, specimens were classie= 
fied by three readers according to the amount of a dispersed 
defect. Classification was mace by comparison of radiographs 
with standard radiographs represent ing 1, 2, 3, ete., “degrees” 
of the defect. 

in this report, the data on these classifications are 
analyzed to determine certain vroperties of the method of 
classification, The conclusions and summary tables are pre= 
sented in six sections: 

1) Definition of "Consensus" 

2) Proportion of Scores “ithin + 1 Degree of 
Consensus. 

3) Differences Among Readers. 

i.) Variability of Interpretations. 

5S) Analysis of Repeated Readings. 


6) Suggestions for Further Experiments. 


1. Definition of "Consensus", 

For the occasional cases where interpretation differ= 
ences exceeded two degrees, the rule for consensus assignment 
followed in your worksheets did not seem to be well-defined, 

The rule adopted by us, was to let the consensus be 
defined as the median of the three scores, Thus, if two or 
more observers agree the degree assignment given by them is 


the median or consensus. If each judge assigns a different 
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degree See 40 a radiograph, the middie number is the median. 
In the few cases where use of the median led to changing the 
value given for the consensus, the revised figure is noted 

(in purple) on the original data sheets, 

2. Proportion of Scores Within + 1 Degree of Consensus. 

From the data whieh were obtained, one may calculate the 
proportion of cases in which a reader assigned a score within 
+ 1 degree of the consensus score of three readers, This may 
be interpreted as an estimate of the average proportion of 
assignments within + 1 of consensus, which would be obtained 
in hypothetical repetitions of the same experiment (sone 
readers, same set of radiographs). The extent to which. this 
estimate can be used in relation to possible different sets 
of radiographs and different groups of readers depends on 
one's degree of belief that the collections of radiographs and 
the groups of readers involved in the present experiment are 
representative of the radiographs and readers about whom it 
is desired to make an inference or prediction. 

From the estimated proportion and the number of readings 
made, one may calculate a lower confidence limit for the 
average proportion of readings within +1 of consensus. Thus, 
one may assert that (at the 0.975 probability level, say) the 
expected proportion of readings by Reader A within +1 of 
consensus is not less than a specified number, The confidence 
limits given in Table I are approximations based on the 


simplifying assumption that the probability of interpreting 
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within + 1 of consensus is the same for each specimen of a 
given alloy. Table Ia gives the proportion of readings 
within + 1 of consensus at each consensus value, 

For reasons discussed below (Section 5), the first 
readings were used for tnis analysis, revised scores were 


ignoredo 


TABLE I 


Proportion of Seores Vithin + 1 Degree of Consensus 


0.975 Confie= 


Proportion of Scores dence Limit for 


: Numbér of within + 1 Degree of Expected Prope 
Reader Readings Consensus ortion +% 
195 Alloy 
A 22 20007 | o99 
B 22 0967 oh 
C 22 0922 ° 
322 Alloy 
A Lhe? o991 098 
B Lili 1.000 099 
C Whig 0998 099 
Dow _H_ (Transverse) 
A 371 0965 oI 
B Bye! 0957 093 
c Byel 0980 093 
Dow H. {Longitudinal 
A 327 0919 096 
B 327 0887 085 
C seq 0985 096 


#) Obtained from Table li in @. So Pearson and H, 0. Hartley 
(ed.), Biometrika Tables for Statisticians, Volume I, 
Cambridge University Press (19 ° 


TABLIE Ia 


Proportion of Scores Within + 1 Degree of Consensus 


Mumber of Reader 
Consensus Readings A B C 
i956 Alloy 
al 66 1,00 1,00 oly 
2 66 1,00 1.00 1.00 
3 Sh 1,00 1,00 1.00 
hy 1,00 3 1,00 
5 38 1,00 087 095 
6 7 1,00 1,00 1,00 
8 63 1,00 1.00 i600 
355 Alloy 
L 85 i300 1.00 1,00 
2 Ll2s o9 1,00 1,00 
3 123 1.00 1.00 1.00 
Ly 76 1,00 1,00 099 
5 lig 1,00 1.00 1,00 
6 hl 1,00 1,00 1,00 
Dow H (Transverse) 
1 hé 1,00 13500 1,00 
2 62 097 1,00 09 
3 69 093 096 o9hk 
ly 82 096 oD o9L 
5 66 098 09. 097 
6 dh 1.00 095 09 
Dow H (Longitudinal) 

1 56 1.00 1.00 1,00 
2 5 1,00 098 1.00 
i g 33 oe °6 
5 ig 1.00 “82 298 
6 57 **b.00 ob1 o9 


%) For Reader A there are only 110 readings, 


3. Differences Among Readers. 
It must be said first that it is not possible to make 


an extensive analysis of observed differences among readers, 
ae the experiment did not provide data permitting com= 
parison of differences among readers with differences among 
repeated readings of the same radiograpn by each reader, 

One type of overeall picture of the differences among 
pannee 3s given by Figures 1 through lh, which are based on 
Table II, Figure 1, for example, shows for each reader the 
distribution among the eight degrees of scores assigned to the 
h22 specimens of 195 Alloy; apparently Reader C may have a 
tendency to avoid assigning degrees 1 and 2 to radiographs of 
195 Alloy specimens, Table II and Tl’igures 1 through h are 
based on first readings. 

In examining Figures 1 through lh, it must be remembered 
that the identity of a reader is not the same in all figures, 


the identities recorded on the worksheets are: 


Reader A Reader B Reader 6 


195 Alloy Pierce LJF Criscuolo 

355 Alloy Polansky IJF Cris, or JDG 
Dow H (Transverse) Polansky IP IDG 

Dow H (Long.) Polansky IJF JDG 


These figures suggest that for some purposes the differ- 
ences among interpretations of radiographs by different readers 
may be substantial. For example, if material is to be accepted 
or rejected according as the degree of defect (porosity or 


microshrinkage) is greater or less than a specified amount, it 


oa & 


could make quite a lot of difference if a reader had a marked 
tendency to avoid assigning scores at the endpoints of the 
scale, | 

It may be recorded that for every alloy the overeall 
distributions for the three readers are (statistically) signie 
ficantly different. The same sort of conclusion is reached 
in other ways, Cecfie»s for Dow H alloy .Reader PB assigns scores 
lower than Reader A for a significantly larger proportion of 
the radiographs on which they disagree, ‘Vhether these differe@ 
ences are practically meaningful or not must be determined in 
the context of the use which is to be made of the degree 


Scores» 


ay 
TABLE IT 


Distribution of Alloy Specimens According 
to Degree Scores Assigned by Individual 
Readers and by Consensus of Three Readers 


Degree 
Reader 1 2 3 ly 5 6 Ts Ox: 


195 Alloy 
A 72 81 39 af 37 eel h2 63 
B 65 63 51 33 h2 ho 62 66 
C 26 50 105 59 7 6 29 60 
Consensus 66 66 Sh h2 38 hi? h6 63 
| Allo 
A 86 106 106 77 51 21 = = 
Bite 85 2699s ld 75 17 0 = = 
Cine 85 120 Lib 75 hé 7 = = 
Consensus 85 #110 123 76 9 h = 
Dow H (Transverse) 
A 43 Sh 60 7h 69 gal @ = 
B 53 13 87 2 51 35 = = 
C 56 61 55 Ohh 59 58 = = 


Consensus 18 62 69 82 66 nD 
Dow H (Longitudinal) 


A BO, a 59 Ls 6 78 ° = 
B 5 t2 Te 7 29 el = ss 
C 5 53 l.6 61 1.8 61 = 


Consensus 56 55 oh 56 9 57 


#) Degrees 7 and 8 are used only for 195 Alloy. 


4) There are two lines in the worksheets where readings are 
missing for Reader A. These two cases are omitted in 
this tabulation, so that each row shows the distribution 
of lih7 readings. 
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Figure 1, Distribution of Readings, 195 Alloy 
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Figure 2, Distribution of Readings, 355 Alloy 
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Figure 3, Distribution of Readings, Dow H (Transverse) 
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Figure h, Distribution of Readings, Dow H (Longitudinal) 
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he Variability of Interpretations, 


A measure of the amount of variability among readers of 
the same radiograph is the range of scores assigned to that 
radiograph, that is, the difference between the largest and 
smallest of the three scores, Table III shows for each degree 
the distribution of these ranges for specimens having consensus- 
score at that degree, To summarize, the table also shows the 
proportion of cases for which the range was lese-thar. or 
equal to one, that is, the proportion of cases in which either 
all three readers agreed = two readers agreed and the third 
reader assigned, an adjacent score, The distribution for all 
readers corbined is also shown, Table III is based on first 
readings. 

: The data provide no way to distinguish between the two 
possible “sources of this Wiseebea variability: differences 
Sa readers and variability of interpretations by one 
reader, 

A third factor affecting variability of interpretation 
is that for saaisenneis exhibiting an extreme amount of defect 
(at either end of the scale) the readers will necessarily tend 
to give scores that are closer together, since the scoring 
method not permit assignment of scores beyond the range rep= 
resented by comparison standards, The effect of this factor 
; “ig seen in Table III, where it is evident that the wider 
ranges of scores tend to occur more frequently when the 
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consensus is nearer the middie of the scale, 
Another measure of variability, which is more appro- 
priate for comparing sources of variability, is the mean 


difference (see sections 5 and 6 below). 


TABLE Iit 
Distribution of Range of Scores Assigned 
by Three Observers 


Proportion of 
Number of Range of Scores Readings with 
Consensus Readings 0 2. 2-3 4k Range < 1 


195 All 
1 66 23 19 22 2 © ob. 
2 66 GG = oI 
3 Sh 10 38 6 «= .«& 389 
i 42 is 19 2 6 «= ool 
5 38 12 Ba 7. 2 4 e716 
) 7 G32 ee Sos 587 
7 He 9 25 12 = = o7 
8 3 Su 69g 3 = o 1,00 
Total 22 148 205 59 10 O ofl 
Allo 
L 85 85 = = = 13,00 
2 110% 69-37 ko = «= 096 
3 123 52 69 2 = gi 098 
i 76 2 7 5 = = 093 
5 lig Le.33) 5 © = 1,00 
6 ty Ee he ) = —] 1,00 
Total U7 2u6 190 li @) 0 098 
Dow H (Transverse ) 
L me 37 1 = = © 1,00 
3 69 il ho i12 2 oT 
BY got: 82 at. 32° 22 ts 065 
5 66 i er aoe he. 065 
6 26 15 3 = = oJ 
Total 371 128 163 61 15 hh of 


*%) Two cases for which there were only two readings are 
omitted, since this is a tabuletion of ranges of scores 


by three observers, 
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TABLE III (Continued) 
Proportion of 


Number of Range of Scores Readings with 
Consensus Readings Cae ks 2 eter aL Range <1 


Dow H (Longitudinal ) 


1 56 36 20 = es » 1,00 
2 55 POLS. Lo Oo 098 
3 5h 10 35 9 2s e@ 083 
h 56 Be Nae tal A, 9 Ae tie: 370 
5 Lg Se. 25 Gea ol3 
6 57 1 2+20 320 3 = 960 
Total 327 91 158 67 9 2 efe 


5. Analysis of Repeated Readings. 


For the Dow H alloy, repeated readings were obtained in most 
eases where the range of the three first readings was two or 
more, The reasons for using the first readings in the fore= 
going analyses are the following. 

First, revised readings were obtained only in cases where 
the range of first readings was large, Suppose, however, that 
the chance of a wide range among the three scores were about the 
same for every radiograph. Then one would expect that the 
ranges of the repeated readings would tend to be smaller than 
the ranges of first readings for the same radiographs if only 
the wide-range casos were repeated. For if the second readings 
were indepéndent of the first, one would expect to find the 
distribution of ranges for second readings similar to the distri- 


bution of ranges among all first readings. Table IV indicates 


lag T ” 
* 
a 
: 
oe pe - oi 
( “a 
" ‘ ~ 
a 
“ee 
ad 
” 
i} 
.. > : i ; 
th eh re Tan: i. eo « Ss wot ow , 
ote Wer’ POs 2 7os! 40 BLE Y BA 
edt | CO iene rte ae * ee al ee soe oe 
i Hees 
i i i ya! ‘ y a 
- ‘ 4 ‘, » i - . P - " , oo 5 
’ * ‘ i i ~ a 
i " . b ihe , w etek iF D he ead? ed Wwod bt Led @) 
: ' 
, fs « ; 
. “Ses a -" i “oer 2 } ' . uf ee 
ta See Ee iid LO O7Re se Sid sort 
a ’ ' 
i i 
, : 1 1 
al ’ 4 wai 7 
‘ : Wy P mt =} ww" a ee coe ed wi vier tel 
4 * } “ae -~ h- » os tale rT ‘i J 7 ~ 00 abt oi al W 
‘ 
h \ i : } i 
i m, galt © - ee ~ ful vw a's - « * 
antwoliokt eds etm ae ey Lec aeioy a 
¢ 1 
‘ 


4 i j mn ; Na 4 hed 
4 f > 4 (er i 1s at. alle « pra ok oy vy 
, ‘ baie Wa = @ Yi? ed “Qt Da? Se! | ' .« aie \ thew hd tp Om ie) oy sg w te ie ies Pp 
ae ; : ti = ty r v & hr »< 1 Me Shad i f 4 rors ears iy A “0b 5 to 
pes oo? 6 Law or We wal wre, i Say ie ) ' & is > et ie is 
mn 4 4 / ti 1» I = Haha a n iw 8 7 a 
re y i on f n _ = 
, ; 1 , P i if - 
a mg! : wg: er 1 aes A ca 
i f ; o y-ysoem of fete tel? 7% eee ee gi Ti fs pie re. m 
ih & ot aiid WAM AD tte -, he adele WP VG Org: > LAD aa ie oe) Oh 
i ‘ae 
’ i. oan a 
* nt be Ta alley hw y <" ' wv 
= 1 bi 
! ~ % SP <9 rh im 3 
iy a hae om abn ~ —;{ ge Ont 
me t * a ’ 
ws f = A 
mney & Pye eee Em ee an 
ee | he OE eet Te alee oe 2 4 yt Lo) we 
ba Al i y: H 
ij vy 
pl { 
7 4. { 
ia r Ui » L<¢ ‘. 
fom7r ) ‘wld y - i 7.7. 
PAL) DOPAL i al 


_ (ee: i _ 7 I ) a ru 
‘i aden © ew | ie 
#lrtekh ot?) ode i tn Eu 


ay 
= 


olbr 
“oe 7 ? 


Se 


that these expectations are roughly realized, Accordingly, 
use of the revised readin:s might lead to under-estimat ion 
of the amount of variability inherent in the process of 
interpreting radiographs, and to overestimation of the 
proportion of readings within + 1 degree of consensus. The 
amount of difference that is involved may be illustrated by 
the following quantities calculated for Dow H (Transverse). 
In calculating the second column, first readings were ree 
placed by second readings wherever available, 


First Readings Revised Readings 
Proportion ‘’ithin 


+ 1 of Consensus 


A 096 099 

B 096 098 

C | 096 098 
Proportion of Ranges 
<i 078 o88 


The tendency for the ranges of second readings to be 
less than those of the first readings is seen even more 
sharply as follows, The difference between first and second 
ranges for each radiograph is positive (i.e., the second 


range is smaller) in an overwhelming number of cases, 


First Range Minus Second Range 


Dow H Alloy =1 0 a 2 Zs 
Transverse = 10 25 16 3 Z 
Longitudinal 2 23 33 10 % = 


Second, the revised readings may not be strictly come 


parable to first readings. The readers knew when making 
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second readings that their first readings had hed a wide 
range, They may in some cases have been able to recall the 
score assigned in the first reading, and the direction (below 
or above) in which this first score deviated from the median 
of first readings. 

The repeated readings might have afforded data for come 
varison of variability among readings by one reader with 
differences among readers, The two considerations discussed 
above, however, make this impossible. Roth have the effect 
of vroviding data on variability between readings by one 
reader under conditions which lead to greatest variability. 

The mean (absolute) difference is appropriate in this 
context as a measure of variability, The mean difference among 
readers is calculated by SE TE the three absolute differe 
-ences.for each radiograph, and dividing the total by three_ 
times the number of radiographs, It fay te recorded that 
the mean differences among readers for first readings were 


as follows, 


Alloy 


195 
355 
Dow H (trans,) | o62 
Dow H (long. ) 067 


The mean difference among readings is calculated by 
surming for each reader the absolute differences between his 


first and second readings and dividing by the number of cases 


for which there were two readings. 
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For Dow H Alloy, the mean differences between first 


and second readings of the same radiograph were as follows: 


Mean Difference Between Readings 


Reader A B C Comb ined 
Transverse ob 1.25 095 1.02 
Longitud inal 083 Poy f 08 o bly 


The data on repeated readings might be used to provide 
information on the variability of the consensus, were it not 
acain for the difficulties mentioned above, “For example, we 
may coripare the mean (absolute) difference between gonsensuses 
of first and second readings with the 8 difference between 


individual readers! first and second readings. 


Dow H Alloy Mean Difference 

Between Consensuses Between Readings 
Transverse G2 1.01 
Longitudinal ol3 06h, 


As expected, the consensus is less variable than individual 
readings. For the reasons noted above, however, these are 
undoubtedly over-estimates of the variability inherent in the 
interpretation process, 

The "reliability" of the consensus may be represented by 
the proportion of cases in which the consensus of first | 
readings and the consensus of second readings differ by at 
most one degree, This is found to be 0,8) for Dow H (trans= 


verse) and 0.93 for Dow H (longitudinal), . 
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In a few cases, a third reading was obtained, There - 
are not enough such cases to permit satisfactory statistical 


analysis. 


TABLE IV 


Distribution of Ranges of Three Readings, for Tirst and 
second Readings on the Same Padiographs, Compared 
With Distribution of Ranges for All First Readings 


Set of Number of Range 
Readings Readings 0) Z 2 3 h 


Dow I] (Transverse) 


Pirst# 5524: © - 37 Lh Le 

Second 55 Lee SOLA 18 a = 

All Pirst a1 128 163 61 15 bb 
Dow_H (Longttudinal ) 

First 69 = = 58 9 2 

Second 69 7 “ie 27 h e3 

All First 327 91 158 #£«67 9 2 


%) First readings for radiographs for which there is also a 
second reading, 


“%) One case where only two readers made second readings is 


omitted, 


6. Suggestions for Further Fxperiments.. 


In this section are noted, first, some possibilities for 
additional experiments which would contribute to more satise 


factory analysis of the present experiment; and second, 
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further possibilities for experiments which might be useful 
if it were desired to answer certain additional questions 
about the process of radiograph interpretation, 

It would be highly desirable to obtain independent second 
readings of at least some of the radiographs used in the 
present experiment, by the same readers, ‘Such data are 
essential if it is desired to make a valid statistical state- 
ment about the variability of the consensus, Such data would 
also provide a basis for deciding whether the differences 
among readers are important relative to variability among 
successive resdinzs by one reader, Independent second readings 
would be obtained by repeating the experimental process for 
the whole set of radiographs or for an appropriately selected 
random sample, soe that obtaining a second reading does not 
depend on the outcome of the first readings. Tlurthermore, 
the readers should, while making the second reading, have no 
way of knowing what the outcome of the first reading was. In 
an experiment where first and second readings are made within 
a short time interval, precautions must be taken to meet this 
letter condition as nearly as possible, 

A completed analysis of this experiment would then 
provide estimates of 

(1) Expected proportion of readings within + degree 


of consensus, 
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(2) Expected proportion of cases in which ppecaens 
sensus of two independent sets of readings differ 
by at most one desree, 

(3) Mean difference between readers, 

{h) Mean difference between two independent readings 
by one reader, 
One would also obtain an idea of the relative magnitude of 
the two main sources of variability, reader differences and 
uncertainty of interpretation. If reader differences apnear 
to be important, one could further determine what the biases 


of individual readers are (e.g., tendency to avoid ends of 


Mao s - 


scale, tendency to assign lower scores). 

Receandat ieee of radiographs. 

A new experiment conducted oe same fasnion as the 
present one (with independent repeated readings), but using 
some of the present radiographs and Borie new ones, would 
provide information as to the possibility of seneralizing the 
application of the results of this experiment. 

Representativeness of readers, 

Similarly, the above experiment might be expanded by 
being duplicated with three new readers, 

Effect of extreme points of scale. 

An experiment might be conducted with different rules, 
namely, with the possibility of assigning an extreme specie 


men to a "degree" beyond the end of the scale, 
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